Michael Clark
Statistician Lead
Consulting for Statistics, Computing and Analytics Research
Advanced Research Computing
The following provides a brief introduction to generalized additive models and some thoughts on getting started within the R environment. It doesn’t assume much more than a basic exposure to regression, and maybe a general idea of R though not necessarily any particular expertise. The presentation is of a very applied nature, and such that the topics build upon the familiar and generalize to the less so, with the hope that one can bring the concepts they are comfortable with to the new material. The audience in mind is a researcher with typical social science training, though is not specific to those disciplines alone.
As this document is more conceptual, a basic familiarity with R is all that is needed to follow the code, though there is much to be gained from simple web browsing on R if one needs it. And while it wasn’t the intention starting out, this document could be seen as a vignette for the mgcv package, which is highly recommended.
This document was created within Rstudio and rmarkdown. Last modified 2016-06-12. Original draft August, 2012.
Color guide:
R Info: R version 3.3.0 (2016-05-03) Supposedly Educational
Let us begin by considering the standard linear regression model (SLiM) estimated via ordinary least squares (OLS). We have some response/variable we wish to study, and believe it to be some function of other variables. In the margin to the right, \(y\) is the variable of interest, \[y\sim \mathcal{N}(\mu,\sigma^{2})\]
\(\mu = b_{0}+b_{1}X_{1}+b_{2}X_{2}...+b_{p}X_{p}\) assumed to be normally distributed with mean \(\mu\) and variance \(\sigma^2\), and the Xs are the predictor variables/covariates in this scenario. The predictors are multiplied by the coefficients (beta) and summed, giving us the linear predictor, which in this case also directly provides us the estimated fitted values.
And since we’ll be using R and might as well get into that mindset, I’ll give an example of how the R code might look like.
mymod = lm(y ~ x1 + x2, data=mydata)One of the issues with this model is that, in its basic form it can be very limiting in its assumptions about the data generating process for the variable we wish to study. Also, while the above is one variant of the usual of the manner in which it is presented in introductory texts, it also very typically will not capture what is going on in a great many data situations.
In that light, we may consider the generalized linear model. Generalized linear models incorporate other types of distributionsOf the exponential family., and include a link function \(g(.)\) relating the mean \(\mu\), or stated differently, the estimated fitted values \(E(y)\), to the linear predictor \(X\beta\), often denoted \(\eta\). The general form is thus \(g(\mu) = X\beta\).
\[g(\mu) = \eta = X\beta \\ E(y) = \mu = g^{\tiny-1}(\eta)\]
Consider again the typical linear regression. We assume a Gaussian (i.e. Normal) distribution for the response, we assume equal variance for all observations, and that there is a direct link of the linear predictor and the expected value \(\mu\), i.e. \(X\beta = \mu\). In fact the typical linear regression model is a generalized linear model with a Gaussian distribution and identity link function.
To further illustrate the generalization, we consider a distribution other than the Gaussian. In the example to the right, we examine a Poisson distribution for some vector of count data.
\[y \sim \mathcal{P}(\mu) \\
g(\mu) = b_{0}+b_{1}X_{1}+b_{2}X_{2}...+b_{p}X_{p}\]
There is only one parameter to be considered, \(\mu\), since for the Poisson the mean and variance are equal. For the Poisson, the (canonical) link function \(g(.)\), is the natural log, and so relates the log of \(\mu\) to the linear predictor. As such we could also write it in terms of exponentiating the right hand side. \[\mu = e^{b_{0}+b_{1}X_{1}+b_{2}X_{2}...+b_{p}X_{p}}\]
While there is a great deal to further explore with regard to generalized linear models, the point here is to simply to note them as a generalization of the typical linear model with which all would be familiar after an introductory course in statistics. As we eventually move to generalized additive models, we can see them as a subsequent step in the generalization.
Now let us make another generalization to incorporate nonlinear forms of the predictors. The form \[ y \sim ExpoFam(\mu, etc.) \\ \mu = E(y) \\ g(\mu) = b{_0} + f(x_1) + f(x_2) +...+f(x_p)\] at the right gives the new setup relating our new, now nonlinear predictor to the expected value, with whatever link function may be appropriate.
So what’s the difference? In short, we are using smooth functions of our predictor variables, which can take on a great many forms, with more detail in the following section. For now, it is enough to note the observed values \(y\) are assumed to be of some exponential family distribution, and \(\mu\) is still related to the model predictors via a link function. The key difference now is that the linear predictor now incorporates smooth functions \(f(x)\) of at least some (possibly all) covariates.
In many data situations the relationship between some covariate(s) \(X\) and response \(y\) is not so straightforward. Consider the plot at the right. Attempting to fit a standard OLS regression results in the blue line, which doesn’t capture the relationship very well.
\[ y \sim \mathcal{N}(\mu,\sigma^{2})\\ \mu = b_{0}+b_{1}X_{1}+b_{2}X^2 \] One common approach we could undertake is to add a transformation of the predictor \(X\), and in this case we might consider a quadratic term such that our model looks something like that to the right.
We haven’t really moved from the general linear model in this case, but we have a much better fit to the data as evidenced by the graph.
There are still other possible routes we could take. Many are probably familiar with loess (or lowess) regression, if only in the sense that often statistical packages may, either by default or with relative ease, add a nonparametric fit line to a scatterplotSee the scatterplots in the car package for example.. By default this is typically a lowess fit.
Take a look at the following figure. For every (sorted) x\(_{0}\) value, we choose a neighborhood around it and, for example, fit a simple regression. As an example, let’s look at x\(_{0}\) = 3.0, and choose a neighborhood of 100 X values below and 100 values above.
Now, for just that range we fit a linear regression model and note the fitted value where \(x_{0}=3.0\).
If we now do this for every X value, we’ll have fitted values based on a rolling neighborhood that is relative to each X value being considered. Other options we could throw into this process, and usually do, would be to fit a polynomial regression for each neighborhood, and weight observations the closer they are to the X value, with less weight given to more distant observations.
The next plot shows the result from such a fitting process, specifically lowess, or locally weighted scatterplot smoothing. For comparison the regular regression fit is also provided. Even without using a lowess approach, we could have fit have fit a model assuming a quadratic relationship, \(y = x + x^2\), and it would result in a far better fit than the simpler model\(y\) was in fact constructed as \(x + x^2 +\) noise.. While polynomial regression might suffice in some cases, the issue is that nonlinear relationships are generally not specified so easilyBolker 2008 notes an example of fitting a polynomial to 1970s U.S. Census data that would have predicted a population crash to zero by 2015.. One can see this a variety of approaches in the next figure, which is a version of that found in Venables and Ripley (2002).
The next figure regards a data set giving a series of measurements of head acceleration in a simulated motorcycle accident. Time is in milliseconds, acceleration in g. Here we have data that are probably not going to be captured with simple transformations of the predictors. We can compare various approaches, and the first is a straightforward cubic polynomial regression, which unfortunately doesn’t help us much. We could try higher orders, which would help, but in the end we will need something more flexible, and generalized additive models can help us in such cases.
Let’s summarize our efforts up to this point.
\[y\sim \mathcal{N}(\mu, \sigma^{2})\] \[\mu = b_{0}+b_{1}X_{1}\]
\[y\sim \mathcal{N}(\mu, \sigma^{2})\] \[\mu = b_{0}+b_{1}X_{1}+b_{2}X^2\]
\[y\sim \mathcal{N}(\mu, \sigma^{2})\] \[g(\mu) = b_{0}+b_{1}X_{1}+b_{2}X^2\]
\[y\sim \mathcal{N}(\mu, \sigma^{2})\] \[g(\mu) = f(X)\]
Now that some of the mystery will hopefully have been removed, let’s take a look at GAMs in action.
The data set has been constructed using average Science scores by country from the Programme for International Student Assessment (PISA) 2006, along with GNI per capita (Purchasing Power Parity, 2005 dollars), Educational Index, Health Index, and Human Development Index from UN data. The key variables are as follows (variable abbreviations in bold):
But even before we get too far, it would be good to know our options in the world of GAMs. Near the end of this document is a list of some packages to be aware of, and there is quite a bit of GAM functionality available within R, even for just plottingggplot2 has basic gam functionality for scatterplot smoothing. Examples are further in the text.. For our purposes we will use mgcv.
The first thing to do is get the data in and do some initial inspections.
d = read.csv('data/pisasci2006.csv')
psych::describe(d)[-1,1:9] #univariate vars n mean sd median trimmed mad min max
Overall 2 57 473.14 54.58 489.00 477.45 48.93 322.00 563.00
Issues 3 57 469.91 53.93 489.00 474.28 41.51 321.00 555.00
Explain 4 57 475.02 54.02 490.00 478.85 47.44 334.00 566.00
Evidence 5 57 469.81 61.74 489.00 475.11 54.86 288.00 567.00
Interest 6 57 528.16 49.84 522.00 525.72 53.37 448.00 644.00
Support 7 57 512.18 26.08 512.00 512.15 25.20 447.00 569.00
Income 8 61 0.74 0.11 0.76 0.75 0.11 0.41 0.94
Health 9 61 0.89 0.07 0.89 0.89 0.08 0.72 0.99
Edu 10 59 0.80 0.11 0.81 0.80 0.12 0.54 0.99
HDI 11 59 0.81 0.09 0.82 0.81 0.09 0.58 0.94
The scatterplot matrix has quite a bit of information to spend some time with- univariate and bivariate density, as well as loess curves. For our purposes we will ignore the issue regarding the haves vs. have-nots on the science scale and save that for another day. Note the surprising negative correlation between interest and overall score for science, which might remind us of Simpson’s paradox, namely, that what occurs for the individual may not be the same for the group. One will also note that while there is a positive correlation between Income and Overall science scores, it flattens out after an initial rise. Linear fits might be difficult to swallow for human development subscales in general, but let’s take a closer look.
We can see again that linear fits aren’t going to do so well for some, though it might be a close approximation for interest in science and support for science. Now let’s run the smooth. By default ggplot2 will use a loess smoother for small data sets (i.e. < 1000 observations), but can use the mgcvgam function as a smoother.
Often in these situations people will perform some standard transformation, such as a log, but it doesn’t help as nearly as often as it is used. For example, in this case one can log the overall score, Income, or both and a linear relation will still not be seen.
We will start with the simple situation of a single predictor. Let’s begin by using a typical linear regression to predict science scores by the Income index. We could use the standard R lm function, but I’ll leave that as an exercise for comparison. We can still do straightforward linear models with the gam function, and again it is important to note that the standard linear model can be seen as a special case of a GAM.
library(mgcv)
mod_lm <- gam(Overall ~ Income, data=d)
summary(mod_lm)
Family: gaussian
Link function: identity
Formula:
Overall ~ Income
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 204.32 35.37 5.777 4.32e-07 ***
Income 355.85 46.79 7.606 5.36e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.518 Deviance explained = 52.7%
GCV = 1504.5 Scale est. = 1448.8 n = 54
What are we getting here? The same thing you get from a regular linear model, because you just ran one. However there are a couple things to look at. The coefficient is statistically significant, but serves as a reminder that it usually a good idea to scale predictor variables so that the effect is more meaningful. Here, moving one unit on Income is akin from going broke to being the richest country. But in a more meaningful sense, if we moved from say, .7 to .8, we’d expect an increase of about 35 points on the science score. We also see the deviance explainedFor those more familiar with generalized linear models, this is calculated as (Dev\(_{Null}\)-Dev\(_{Residual}\))/Dev\(_{Null}\)
One can verify this by running the same model via the glm, and using the corresponding values from the summary of the model object., which serves as a generalization of R-squared, and in this case, it actually is equivalent to R-squared. Likewise there is the familiar adjusted version of it to account for small sample size and model complexity. The scale estimate is the scaled deviance, which here is equivalent to the residual sums of squares. The GCV score we will save for when we run a GAM.
Let’s now try some nonlinear approaches, keeping in mind that \(\mu=f(x)\). As a point of comparison, we can start by trying a standard polynomial regression, and it might do well enoughThis example is pretty much straight from [^wood_generalized_2006] with little modification.. To start we must consider a basis for use, i.e. a set of basis functions \(b_j\) that will be combined to produce \(f(x)\):
\[f(x)=\displaystyle\sum\limits_{j=1}^q b_{j}(x)\beta_{j}\]
To better explain by example, if we had a cubic polynomial the basis is: \(b_1(x) = 1\), \(b_2(x)=x\), \(b_3(x)=x^2\), \(b_4(x)=x^3\), which, when combined give us our \(f(x)\). For those familiar with the mechanics of linear modeling, this is just our ‘model matrix’, which we can save out with many modelling functions in R, or using the model.matrix function. The following visualization allows us to see the effects in action. It is based on the results extracted from running the model See poly for how to fit a polynomial in R. and obtaining the coefficients (e.g. the first plot represents the intercept of 470.44, the second plot, our \(b_2\) coefficient of 289.5 multiplied by Income and so forth). The bottom plot shows the final fit \(f(x)\), i.e. the linear combination of the basis functions.
At this point we have done nothing we couldn’t do in our regular regression approach, but the take home message is that as we move to GAMs we are going about things in much the same fashion; we are simply changing the nature of the basis, and have a great deal more flexibility in choosing the form.
In the next figure I show the fit using a ‘by-hand’ cubic spline basis (see the appendix and p.126-7 in [^wood_generalized_2006]).
A cubic spline is essentially a connection of multiple cubic polynomial regressions. We choose points of the variable at which to create sections, and these points are referred to as knots. Separate cubic polynomials are fit at each section, and then joined at the knots to create a continuous curve. The graph represents a cubic spline with 8 knotsTen including the endpoints. between the first and third quartiles. The red line uses the GAM functionality within ggplot2’s geom_smooth as a point of comparison.
Let’s now fit an actual generalized additive model using the same cubic spline as our smoothing function. We again use the gam function as before for basic model fitting, but now we are using a function s within the formula to denote the smooth terms. Within that function we also specify the type of smooth, though a default is available. I chose ‘cr’, denoting cubic regression splines, to keep consistent with our previous example.
mod_gam1 <- gam(Overall ~ s(Income, bs="cr"), data=d)
summary(mod_gam1)
Family: gaussian
Link function: identity
Formula:
Overall ~ s(Income, bs = "cr")
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 470.444 4.082 115.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(Income) 6.895 7.741 16.67 1.59e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) = 0.7 Deviance explained = 73.9%
GCV = 1053.7 Scale est. = 899.67 n = 54
The first thing to note is, that aside from the smooth part, our model code is similar to what we’re used to with core R functions such as lm and glm. In the summary, we first see the distribution assumed as well as the link function used, in this case normal and identity, respectively, which to iterate, had we had no smoothing would result in a standard regression model. After that we see that the output is separated into parametric and smooth, or nonparametric parts. In this case, the only parametric component is the intercept, but it’s good to remember that you are not bound to smooth every effect of interest, and indeed, as we will discuss in more detail later, part of the process may involve refitting the model with terms that were found to be linear for the most part anyway. The smooth component of our model regarding a country’s income’s relationship with overall science score is statistically significant, but there are a couple of things in the model summary that would be unfamiliar.
We’ll start with the effective degrees of freedom, or edf. In typical OLS regression the model degrees of freedom is equivalent to the number of predictors/terms in the model. This is not so straightforward with a GAM due to the smoothing process and the penalized regression estimation procedure, something that will be discussed more laterIn this example there are actually 9 terms associated with this smooth, but they are each ‘penalized’ to some extent and thus the edf does not equal 9.. In this situation, we are still trying to minimize the residual sums of squares, but also have a built in penalty for ‘wiggliness’ of the fit, where in general we try to strike a balance between an undersmoothed fit and an oversmoothed fit. The default p-value for the test is based on the effective degrees of freedom and the rank \(r\) of the covariance matrix for the coefficients for a particular smooth, so here conceptually it is the p-value associated with the \(F(r, n-edf)\). However there are still other issues to be concerned about, and ?summary.gam will provide your first step down that particular rabbit hole. For hypothesis testing an alternate edf is actually used, which is the other one provided there in the summary resultHere it is noted Ref.df but if, for example, the argument p.type = 5 is used, it will be labeled Est.Rank. Also, there are four p-value types one can choose from. The full story of edf, p-values and related is scattered throughout Wood’s text. See also ?anova.gam. At this point you might be thinking these p-values are a bit fuzzy, and you’d be right. The gist is, they aren’t to be used for harsh cutoffs, say, at an arbitrary .05 levelBut then, standard p-values shouldn’t be used that way either., but if they are pretty low you can feel comfortable claiming statistical significance, which of course is the end all, be all, of the scientific endeavor.
The GCV, or generalized cross validation score can be taken as an estimate of the mean square prediction error based on a leave-one-out cross validation estimation process. We estimate the model for all observations except \(i\), then note the squared residual predicting observation \(i\) from the model. Then we do this for all observations. However, the GCV score is an efficient measure of this concept that doesn’t actually require fitting all those models and overcomes other issuesIn this initial model the GCV can be found as: \[GCV = \frac{n*scaled\, est.}{(n-edf-{[n\,of\, parametric\, terms]})^{2}}\]. It is this score that is minimized when determining the specific nature of the smooth. On its own it doesn’t tell us much, but we can use it similar to AIC as a comparative measure to choose among different models, with lower being better.
One can get sense of the form of the fit by simply plotting the model object as follows:
par(bg='transparent')
plot(mod_gam1)The intervals are Bayesian credible intervals based on the posterior predictive distribution. In this single predictor case one can also revisit the previous graph, where the dark red line was constructed directly from ggplot.
Let us now compare our regular regression fit to the GAM model fit. The following shows how one can extract various measures of performance and the subsequent table shows them gathered together.
AIC(mod_lm)[1] 550.2449
summary(mod_lm)$sp.criterion GCV.Cp
1504.496
summary(mod_lm)$r.sq #adjusted R squared[1] 0.5175346
Do the same to extract those same elements from the GAM. The following display makes for easy comparison.
| AIC | GCV | R2 | |
|---|---|---|---|
| LM | 550.24 | 1504.50 | 0.52 |
| GAM | 529.81 | 1053.73 | 0.70 |
Comparing these various measures, it’s safe to conclude that the GAM fits better. We can also perform the familiar statistical test via the anova function we apply to other R model objects. As with the previous p-value issue, we can’t be too particular, and technically one could have a model with more terms but lower edf, where it just wouldn’t even make sense?anova.gam for more information.. As it would be best to be conservative, we’ll proceed cautiously.
anova(mod_lm, mod_gam1, test="Chisq")Analysis of Deviance Table
Model 1: Overall ~ Income
Model 2: Overall ~ s(Income, bs = "cr")
Resid. Df Resid. Dev Df Deviance Pr(>Chi)
1 52.000 75336
2 46.105 41479 5.8951 33857 1.188e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It would appear the anova results tell us what we have probably come to believe already, that incorporating nonlinear effects has improved the model considerably.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Let’s now see what we can do with a more realistic case where we have added model complexity.
\subsection{\small{Linear Fit}}We’ll start with the typical linear model approach again, this time adding the Health and Education indices.
<<mod_lm2, size=‘footnotesize’, message=FALSE, cache=TRUE>>= mod_lm2 <- gam(Overall ~ Income + Edu + Health, data=d) summary(mod_lm2) @
It appears we have statistical effects for Income and Education, but not for Health, and the adjusted R-squared suggests a notable amount of the variance is accounted for. Let’s see about nonlinear effects.
\subsection{\small{GAM}}As far as the generalized additive model goes, we can approach things in a similar manner as before. We will ignore the results of the linear model for now and look for nonlinear effects for each covariate.
<<mod_gam2, size=‘footnotesize’, message=FALSE, warning=FALSE, cache=TRUE>>= mod_gam2 <- gam(Overall ~ s(Income) + s(Edu) + s(Health), data=d) summary(mod_gam2) @
There are again a couple things to take note of. First, statistically speaking, we come to the same conclusion as the linear model regarding the individual effects. One should take particular note of the effect of Health index. The effective degrees of freedom with value 1 suggests that it has essentially been reduced to a simple linear effect. The following will update the model to explicitly model the effect as such, but as one can see, the results are identical.
<<mod_gam2_update, size=‘footnotesize’, message=FALSE, warning=FALSE, cache=TRUE>>= mod_gam2B = update(mod_gam2, .~.-s(Health) + Health) summary(mod_gam2B) @
We can also note that this model accounts for much of the variance in Overall science scores, with an adjusted R-squared of .86. In short, it looks like the living standards and educational resources of a country are associated with overall science scores, even if we don’t really need the Health index in the model.
\subsection{\small{Graphical Display}}Now we examine the effects of interest visually. In the following code, aside from the basic plot we have changed the default options to put all the plots on one page, added the partial residuals, and changed the symbol and its respective size, among other options.
<<mod_gam2_plot, size=‘footnotesize’, tidy=FALSE, message=FALSE, eval=FALSE, cache=TRUE>>= plot(mod_gam2, pages=1, residuals=T, pch=19, cex=0.25, scheme=1, col=‘#FF8000’, shade=T,shade.col=‘gray90’) @
Here we can see the effects of interest, and again one might again note the penalized-to-linear effect of Health. We again see the tapering off of Income’s effect at its highest level, and in addition, a kind of sweet spot for a positive effect of Education in the mid-range values, with a slight positive effect overall. Health, as noted, has been reduced to a linear, surprisingly negative effect, but again this is non-significant. One will also note the y-axis scale. The scale is on that of the linear predictor, but due to identifiability constraints, the smooths must sum to zero, and thus are presented in a mean-centered fashion.
Previously, we had only one predictor and so we could let ggplot do the work to produce a plot on the response scale, although we could have added the intercept back in. Here we’ll need to do a little more. The following will produce a plot for Income on the scale of the response, with the other predictors held at their mean. Seen at right.
<<plot_mod_gam2_response, size=‘footnotesize’, message=FALSE, eval=FALSE, tidy=FALSE, cache=TRUE>>= # Note that mod_gam2\(model is the data that was used in the modeling process, # so it will have NAs removed. testdata = data.frame(Income=seq(.4,1, length=100), Edu=mean(mod_gam2\)model\(Edu), Health=mean(mod_gam2\)model$Health)) fits = predict(mod_gam2, newdata=testdata, type=‘response’, se=T) predicts = data.frame(testdata, fits)
ggplot(aes(x=Income,y=fit), data=predicts) + geom_smooth(aes(ymin = fit - 1.96se.fit, ymax=fit + 1.96se.fit), fill=‘gray80’, size=1,stat=‘identity’) + ggtheme @
This gives us a sense for one predictor, but let’s take a gander at Income and Education at the same time. Previously, while using the function , it actually is , or the basic plot function for a GAM class object. There is another plotting function, , that will give us a bit more to play with. The following will produce a contour plot (directly to the right) with Income on the x axis, Education on the y, with values on the response scale given by the contours, with lighter color indicating higher values.
<<contour, size=‘footnotesize’, cache=TRUE, message=FALSE, eval=FALSE>>= vis.gam(mod_gam2, type=‘response’, plot.type=‘contour’) @
First and foremost the figure to the right reflects the individual plots (e.g. one can see the decrease in scores at the highest income levels), and we can see that middling on Education and high on Income generally produces the highest scores. Conversely, being low on both the Education and Income indices are associated with poor Overall science scores. However, while interesting, these respective smooths were created separately of one another, and there is another way we might examine how these two effects work together in predicting the response.
Let’s take a look at another approach, continuing the focus on visual display. It may not be obvious at all, but one can utilize smooths of more than one variable, in effect, a smooth of the smooths of the variables that go into it. Let’s create a new model to play around with this feature. After fitting the model, an example of the plot code is given producing the middle figure, but I also provide a different perspective as well as contour plot for comparison with the graph based on separate smooths.
<<mod_gam3, size=‘footnotesize’, message=FALSE, warning=FALSE, cache=TRUE>>= mod_gam3 <- gam(Overall ~ te(Income,Edu), data=d) summary(mod_gam3) @
<<mod_gam3_plot, size=‘footnotesize’, message=FALSE, eval=FALSE, tidy=FALSE, cache=TRUE>>= vis.gam(mod_gam3, type=‘response’, plot.type=‘persp’, phi=30, theta=30,n.grid=500, border=NA) @
In the above we are using a type of smooth called a tensor product smooth, and by smoothing the marginal smooths of Income and Education, we see a bit clearer story. As we might suspect, wealthy countries with more of an apparent educational infrastructure are going to score higher on the Overall science score. However, wealth is not necessarily indicative of higher science scores (note the dark bottom right corner on the contour plot), though without at least moderate wealth hopes are fairly dim for a decent score.
One can also, for example, examine interactions between a smooth and a linear term \(f(x)z\), and in a similar vein of thought look at smooths at different levels of a grouping factor. We’ll leave that for some other time.
\subsection{\small{Model Comparison}}As before, we can examine indices such as GCV or perhaps adjusted R-square, which both suggest our GAM performs considerably better. Statistically we can compare the two models with the anova function as before.
<<mod_lm2_mod_gam2_anova, size=‘footnotesize’, message=FALSE, cache=TRUE>>= anova(mod_lm2, mod_gam2, test=“Chisq”) @
Not that we couldn’t have assumed as such already, but now we have additional statistical evidence to suggest that incorporating nonlinear effects improves the model.
Venables, William N., and Brian D. Ripley. 2002. Modern Applied Statistics with S. Birkhäuser.